Skip to content

fix: keep large LOAD external scan multi-CN#24855

Merged
heni02 merged 2 commits into
matrixorigin:mainfrom
aptend:fix-load-external-stats-multicn
Jun 7, 2026
Merged

fix: keep large LOAD external scan multi-CN#24855
heni02 merged 2 commits into
matrixorigin:mainfrom
aptend:fix-load-external-stats-multicn

Conversation

@aptend

@aptend aptend commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #24846

What this PR does / why we need it:

This fixes LOAD external scan stats so large LOAD jobs keep row/cardinality semantics for Cost, Outcnt, TableCnt, and BlockNum, while preserving Cost * Rowsize as the input-size hint used by external scan parallel sizing.

Previously the LOAD stats used input bytes as Cost with Rowsize=1 and forced BlockNum=1 / TableCnt=1. That can make large CSV/TBL LOAD choose AP_ONECN instead of the expected multi-CN AP path.

Tests:

CGO_CFLAGS="-I$(pwd)/cgo -I$(pwd)/thirdparties/install/include" go test ./pkg/sql/plan -count=1

@aptend aptend requested review from aunjgr and ouyuanning as code owners June 4, 2026 15:19
Copilot AI review requested due to automatic review settings June 4, 2026 15:19
@qodo-code-review

Copy link
Copy Markdown

Qodo reviews are paused for this user.

Troubleshooting steps vary by plan Learn more →

On a Teams plan?
Reviews resume once this user has a paid seat and their Git account is linked in Qodo.
Link Git account →

Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center?
These require an Enterprise plan - Contact us
Contact us →

@matrix-meow matrix-meow added the size/M Denotes a PR that changes [100,499] lines label Jun 4, 2026
@aptend aptend marked this pull request as draft June 4, 2026 15:19
@mergify mergify Bot added the kind/bug Something isn't working label Jun 4, 2026
@XuPeng-SH

Copy link
Copy Markdown
Contributor

再补一个实质问题:

现在 makeLoadExternalStats 用的是 GetRowSizeFromTableDef(tableDef, true) * 0.8 来估算 file-based LOAD 的 rowSize,但在 LOAD DATA ... (col, @var, ...) 这条路径里,传进去的 tableDef 已经在 getExternalWithColListProject / bind_load.go 里被裁成了只包含真实目标列的 schema,而 @var 这种输入字段不会出现在这个 tableDef 里。

结果就是:实际输入行里明明还有额外字段,新的统计却只按目标列宽来算,rowSize 会被明显低估,进而把 rowCount / Cost / Outcnt / BlockNum 系统性放大。这个路径不是理论上的,仓库里现有用例就有:test/distributed/cases/load_data/load_data.sql 里多处都在测 (@col1, col2) / (col1, @col2)

对这类 LOAD,执行路径阈值会被更早打穿,可能把本来不该进 multi-CN 的任务提前推到 multi-CN。也就是说,这次改动把“按真实输入行宽估算”的目标,在带 @var 的 column list 场景下又重新打偏了。

@XuPeng-SH

Copy link
Copy Markdown
Contributor

看了最新更新后,前面 BlockNum 的 off-by-one 和 inline CSV 少算换行这两个点已经修掉了,但我还看到两个残留问题:

  1. IGNORE n LINES 的首行采样可能仍然读到 header
    现在 estimateLoadRowsizeFromFirstLine 会按传入的 offset 去读首行做 rowSize 估算,但 offset = IgnoredLines(...) 只在 stmt.Param.Tail.IgnoredLines > 0 && stmt.Param.Parallel && noCompress && !stmt.Param.Local 时才会计算。也就是说,只要这条 LOAD 一开始没有开 Parallel,即使声明了 IGNORE 1 LINES,rowSize 采样仍然会从文件头开始,拿到 header 而不是首条数据行。

    这里的 stats 仍然会影响 GetExecType()Cost * Rowsize 的并行度估算,所以这不是纯测试问题。建议把 header offset 的计算从 Parallel 条件里解耦,或者让采样逻辑自己跳过 ignored lines。

  2. 文件被不必要地排除在首行采样之外
    estimateLoadRowsizeFromFirstLine 现在要求 loadLinesTerminatedBy(param) == "\n",这会把显式 LINES TERMINATED BY \r\n 的 CSV 直接打回 schema fallback。\n\n 但 readExternalFirstLineSize 本身就是按 ReadString(n) 读的,天然可以把 \r\n 一起算进来;而且仓库里现有 LOAD DATA case 也明确支持 \r\n。所以这条 gate 会让一类常见文本文件继续落在偏差更大的 schema 估算上,和这次 PR 想修正的方向不一致。

@aptend aptend force-pushed the fix-load-external-stats-multicn branch from 3308b8e to 1641216 Compare June 7, 2026 15:47

@XuPeng-SH XuPeng-SH left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

最新更新把我前面关注的点都补齐了:

  • BlockNum 边界的 off-by-one 改成了真正的 ceil
  • inline CSV 的行宽估算现在会把 terminator 算进去;
  • file-based text load 新增了首行采样,并且已经补上了 IGNORE LINES / \r\n 这些我上轮卡住的残留点;
  • 相关单测也跟着补到了边界和采样分支。

这版我这边看下来可以过。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working size/M Denotes a PR that changes [100,499] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants